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ABSTRACT 

In this paper we present TDLeaf(A), a variation on the TD(A) algorithm that enables it to be used 
in conjunction with minimax search. We present some experiments in both chess and backgammon 
which demonstrate its utility and provide comparisons with TD(A) and another less radical variant, TD- 
directed(A). In particular, our chess program, "KnightCap," used TDLeaf(A) to learn its evaluation 
function while playing on the Free Internet Chess Server (FICS, f ics . onenet . net). It improved 
from a 1650 rating to a 2100 rating in just 308 games. We discuss some of the reasons for this success 
and the relationship between our results and Tesauro's results in backgammon. 



1. Introduction 

TD(A), developed by Sutton [H], has its roots in the 
learning algorithm of Samuel's checkers program fl]. It 
is an elegant algorithm for approximating the expected 
long term future cost of a stochastic dynamical system 
as a function of the current state. The mapping from 
states to future cost is implemented by a parameterised 
function approximator such as a neural network. The 
parameters are updated online after each state transition, 
or in batch updates after several state transitions. The 
goal of the algorithm is to improve the cost estimates as 
the number of observed state transitions and associated 
costs increases. 

Tesauro's TD-Gammon is perhaps the most remark- 
able success of TD(A). It is a neural network backgam- 
mon player that has proven itself to be competitive with 
the best human backgammon players [||. 

Many authors have discussed the peculiarities of 
backgammon that make it particularly suitable for Tem- 
poral Difference learning with self-play [Q, |[ |^] . Princi- 
ple among these are speed of play: TD-Gammon learnt 
from several hundred thousand games of self -play, rep- 
resentation smoothness: the evaluation of a backgam- 
mon position is a reasonably smooth function of the 
position (viewed, say, as a vector of piece counts), mak- 
ing it easier to find a good neural network approxima- 
tion, and stochasticity: backgammon is a random game 
which forces at least a minimal amount of exploration 
of search space. 

As TD-Gammon in its original form only searched 
one-ply ahead, we feel this list should be appended 
with: shallow search is good enough against hu- 
mans. There are two possible reasons for this; ei- 
ther one does not gain a lot by searching deeper in 



backgammon (questionable given that recent versions 
of TD-Gammon search to three-ply for a significant 
performance improvement), or humans are incapable of 
searching deeply and so TD-Gammon is only compet- 
ing in a pool of shallow searchers. 

In contrast, finding a representation for chess, othello 
or Go which allows a small neural network to order 
moves at one-ply with near human performance is a 
far more difficult task [^[ [II], H], For these games, re- 
liable tactical evaluation is difficult to achieve without 
deep search. This requires an exponential increase in 
the number of positions evaluated as the search depth 
increases. Consequently, the computational cost of the 
evaluation function has to be low and hence, most chess 
and othello programs use linear functions. 

In the next section we look at reinforcement learning 
(the broad category into which TD(A) falls), and then in 
subsequent sections we look at TD(A) in some detail and 
introduce two variations on the theme: TD-directed(A) 
and TDLeaf(A). The first uses minimax search to gen- 
erate better training data, and the second, TDLeaf(A), 
is used to learn an evaluation function for use in deep 
minimax search. 

2. Reinforcement Learning 

The popularly known and best understood learning tech- 
niques fall into the category of supervised learning. 
This category is distinguished by the fact that for each 
input upon which the system is trained, the "correct" 
output is known. This allows us to measure the error 
and use it to train the system. 

For example, if our system maps input X.- L to output 
Y(, then, with Yi as the "correct" output, we can use 
(Y( — Yi) 2 as a measure of the error corresponding to 



Xi. Summing this value across a set of training exam- 
ples yields an error measure of the form Y^iO^i ~ YA 2 , 
which can be used by training techniques such as back 
propagation. 

Reinforcement learning differs substantially from su- 
pervised learning in that the "correct" output is not 
known. Hence, there is no direct measure of error, 
instead a scalar reward is given for the responses to a 
series of inputs. 

Consider an agent reacting to its environment (a gen- 
eralisation of the two-player game scenario). Let S de- 
note the set of all possible environment states. Time 
proceeds with the agent performing actions at discrete 
time steps t = 1, 2, ... . At time t the agent finds the 
environment in state xt £ S, and has available a set of 
actions A Xt . The agent chooses an action at £ A Xt , 
which takes the environment to state x t+ i with proba- 
bility p(x t , Xt+i, &*)■ After a determined series of ac- 
tions in the environment, perhaps when a goal has been 
achieved or has become impossible, the scalar reward, 
t(xn) where N is the number of actions in the series, is 
awarded to the agent. These rewards are often discrete, 
eg: "1" for success, for failure, and "0" otherwise. 

For ease of notation we will assume all series of ac- 
tions have a fixed length of N (this is not essential). If 
we assume that the agent chooses its actions according 
to some function a(x) of the current state x (so that 
a{x) £ A x ), the expected reward from each state x £ S 
is given by 



J* (a) 



= E XN \ x r(x N ), 



(1) 



where the expectation is with respect to the transition 
probabilities p(x t , x t +i, a(x t )). 

Once we have J* (it), we can ensure that actions are 
chosen optimally in any state by using the following 
equation to minimise the expected reward for the en- 
vironment ie: the other player in the game. 



a*(x) := argmin aeAx J* (x' a ,w). 



(2) 



For very large state spaces S it is not possible store 
the value of J*(x) for every x £ S, so instead we 
might try to approximate J* using a parameterised func- 
tion class J : S x M fe — > K, for example linear func- 
tion, splines, neural networks, etc. J(-,w) is assumed 
to be a differentiable function of its parameters w = 
(wx, . . . , Wfc). The aim is to find w so that J(x, w) is 
"close to" J* (it), at least in so far as it generates the 
correct ordering of moves. 

This approach to learning is quite different from that 
of supervised learning where the aim is to minimise an 
explicit error measurement for each data point. 

Another significant difference between the two 
paradigms is the nature of the data used in training. 
With supervised learning it is fixed, whilst with rein- 
forcement learning the states which occur during train- 
ing are dependent upon the agent's choice of action, and 



thus on the training algorithm which is modifying the 
agent. This dependency complicates the task of proving 
convergence for TD(A) in the general case [^]. 

3. The TD(A) algorithm 

Temporal Difference learning or TD(A), is perhaps the 
best known of the reinforcement learning algorithms. It 
provides a way of using the scalar rewards such that 
existing supervised training techniques can be used to 
tune the function approximator. Tesauro's TD-Gammon 
for example, uses back propagation to train a neural net- 
work function approximator, with TD(A) managing this 
process and calculating the necessary error values. 

Here we consider how TD(A) would be used to train 
an agent playing a two-player game, such as chess or 
backgammon. 

Suppose x\, . . . , xn-i, xn is a sequence of states in 
one game. For a given parameter vector w, define the 
temporal difference associated with the transition xt — » 
x t +i by 



d t := J(x t +i,w) - J(x t ,w). 



(3) 



Note that d t measures the difference between the reward 
predicted by J(-, w) at time t + 1, and the reward pre- 
dicted by J(-, w) at time t. The true evaluation function 
J* has the property 

E Xt+1 \ Xt [J*(x t+ i) - J*{x t )\ = 0, 

so if J(-, w) is a good approximation to J*, E x t+1 \ Xt dt 
should be close to zero. For ease of notation we will 
assume that J(xn, w) = t{xn) always, so that the final 
temporal difference satisfies 

d-N-i = J(x N ,w)-J(x N -i,w) = r(x N )-J(x N -x,w) 

That is, djv_i is the difference between the true out- 
come of the game and the prediction at the penultimate 
move. 

At the end of the game, the TD(A) algorithm updates 
the parameter vector w according to the formula 



N-l 



w 



a ^2 VJ(si,ifl) 



N-l 



(4) 



where V </(•, w) is the vector of partial derivatives of J 
with respect to its parameters. The positive parameter 
a controls the learning rate and would typically be "an- 
nealed" towards zero during the course of a long series 
of games. The parameter A £ [0, 1] controls the extent 
to which temporal differences propagate backwards in 
time. To see this, compare equation (Q) for A = 0: 



N-l 

-w + a VJ(xt, w)d t 
t=i 

N-l 



a 7^ VJ(x t ,w) J(x t +i,w) — J(x t ,w) 



(5) 



and A = 1: 



N—l 



w := w + a V J(xt, w) r(xjsr) — J(xt,'' 



(6) 



Consider each term contributing to the sums in equa- 
tions (^) and (^|). For A = the parameter vector is 
being adjusted in such a way as to move J(xt,w) — 
the predicted reward at time t — closer to J(xt+i, w) 
— the predicted reward at time t + 1. In contrast, TD(1) 
adjusts the parameter vector in such away as to move 
the predicted reward at time step t closer to the final re- 
ward at time step N. Values of A between zero and one 
interpolate between these two behaviours. Note that (||) 
is equivalent to gradient descent on the error function 

r i 2 

E(w) := J2t=i r(x N ) - J(x f ,w) . 

Tesauro [0, ^ and those who have replicated his work 
with backgammon, report that the results are insensitive 
to the value of A and commonly use a value around 0.7. 
Recent work by Beale and Smith [[!]] however, suggests 
that in the domain of chess there is greater sensitivity 
to the value of A, with it perhaps being profitable to 
dynamically tune A. 

Successive parameter updates according to the TD(A) 
algorithm should, over time, lead to improved predic- 
tions of the expected reward «/(•, w). Provided the ac- 
tions a(xt) are independent of the parameter vector w, 
it can be shown that for linear J(-, w), the TD(A) algo- 
rithm converges to a near-optimal parameter vector JlO) ] . 
Unfortunately, there is no such guarantee if J(-,uu) is 
non-linear [|lO||, or if a(xt) depends onio [0], 

4. Two New Variants 

For argument's sake, assume any action a taken in state 
x leads to predetermined state which we will denote 
by x' a . Once an approximation J(-,u>) to J* has been 
found, we can use it to choose actions in state x by 
picking the action a 6 A x whose successor state x' a 
minimizes the opponent's expected reward^]: 



a(x) := argmin aeAx J«,w). 



(7) 



This was the strategy used in TD-Gammon. Unfortu- 
nately, for games like othello and chess it is difficult to 
accurately evaluate a position by looking only one move 
or ply ahead. Most programs for these games employ 
some form of minimax search. In minimax search, one 
builds a tree from position x by examining all possible 
moves for the computer in that position, then all possi- 
ble moves for the opponent, and then all possible moves 
for the computer and so on to some predetermined depth 
d. The leaf nodes of the tree are then evaluated using 
a heuristic evaluation function (such as J(-,w)), and 

'if successor states are only determined stochastically by the 
choice of a, we would choose the action minimizing the expected 
reward over the choice of successor states. 



the resulting scores are propagated back up the tree by 
choosing at each stage the move which leads to the best 
position for the player on the move. See figure [l] for 
an example game tree and its minimax evaluation. With 
reference to the figure, note that the evaluation assigned 
to the root node is the evaluation of the leaf node of the 
principal variation; the sequence of moves taken from 
the root to the leaf if each side chooses the best available 
move. 

Our TD-directed(A) variant utilises minimax search 
by allowing play to be guided by minimax, but still de- 
fines the temporal differences to be the differences in 
the evaluations of successive board positions occurring 
during the game, as per equation (|3|). 

Let Jd(x, w) denote the evaluation obtained for state 
x by applying J(-,w) to the leaf nodes of a depth d 
minimax search from x. Our aim is to find a parameter 
vector w such that Jd(-,w) is a good approximation to 
the expected reward J* . One way to achieve this is to 
apply the TD(A) algorithm to Jd(%, w). That is, for each 
sequence of positions x\, . . . , x^ in a game we define 
the temporal differences 

d t := Jd{x t +i,w) - J d {x t ,w) (8) 

as per equation (||), and then the TD(A) algorithm (Q) 
for updating the parameter vector w becomes 



w 



w ■ 



N-l 

E 

t=i 



+ a V VJ d {x t ,w) 



N—l 
3=t 



(9) 



One problem with equation (g) is that for d > 1, 
Jd(x,w) is not a necessarily a differentiable function 
of w for all values of w, even if w) is everywhere 
differentiable. This is because for some values of w 
there will be "ties" in the minimax search, i.e. there 
will be more than one best move available in some of 
the positions along the principal variation, which means 
that the principal variation will not be unique. Thus, the 
evaluation assigned to the root node, Jd(x, w), will be 
the evaluation of any one of a number of leaf nodes. 

Fortunately, under some mild technical assumptions 
on the behaviour of J(x, w), it can be shown that for 
all states x and for "almost all" w £ R fe , Jd(x,w) 
is a differentiable function of w. Note that Jd(x,w) 
is also a continuous function of w whenever J(x, w) 
is a continuous function of w. This implies that even 
for the "bad" pairs (x, w), VJd{x, w) is only undefined 
because it is multi-valued. Thus we can still arbitrarily 
choose a particular value for VJd(x, w) if w happens to 
land on one of the bad points. 

Based on these observations we modified the TD(A) 
algorithm to take account of minimax search: instead of 
working with the root positions x\, . . . , xjv, the TD(A) 
algorithm is applied to the leaf positions found by min- 
imax search from the root positions. We call this algo- 
rithm TDLeaf(A). 




Fig. 1 : Full breadth, 3-ply search tree illustrating the minimax rule for propagating values. Each of the leaf nodes (H-O) is given a score by the 
evaluation function, J(-,ui). These scores are then propagated back up the tree by assigning to each opponent's internal node the minimum 
of its children's values, and to each of our internal nodes the maximum of its children's values. The principle variation is then the sequence of 
best moves for either side starting from the root node, and this is illustrated by a dashed line in the figure. Note that the score at the root node 
A is the evaluation of the leaf node (L) of the principal variation. As there are no ties between any siblings, the derivative of A's score with 
respect to the parameters w is just V J(L, w). 



5. Experiments with Chess 

In this section we describe several experiments in which 
the TDLeaf(A) and TD-directed(A) algorithms were 
used to train the weights of a linear evaluation function 
for our chess program, called KnightCap. 

For our main experiment we took KnightCap's eval- 
uation function and set all but the material parameters 
to zero. The material parameters were initialised to the 
standard "computer" values^. With these parameter set- 
tings KnightCap was started on the Free Internet Chess 
server (FICS, f ics . onenet . net). To establish its 
rating, 25 games were played without modifying the 
evaluation function, after which it had a blitz (fast time 
control) rating of 1650 ± 50^|. We then turned on the 
TDLeaf(A) learning algorithm, with A = 0.7 and the 
learning rate a = 1.0. The value of A was chosen 
arbitrarily, while a was set high enough to ensure rapid 
modification of the parameters. 

After only 308 games, KnightCap's rating climbed to 
2110 ± 50. This rating puts KnightCap at the level of 
US Master. 

We repeated the experiment using TD-directed(A), 
and observed a 200 point rating rise over 300 games. 
A significant improvement, but slower than TDLeaf(A). 

There are a number of reasons for KnightCap's re- 
markable rate of improvement. 

1 . KnightCap started out with intelligent material pa- 
rameters. This put it close in parameter space to 
many far superior parameter settings. 

2. Most players on FICS prefer to play opponents 
of similar strength, and so KnightCap's opponents 
improved as it did. Hence it received both positive 
and negative feedback from its games. 

3. KnightCap was not learning by self-play. 

2 1 for a pawn, 4 for a knight, 4 for a bishop, 6 for a rook and 1 2 
for a queen. 

3 After some experimentation, we have estimated the standard de- 
viation of FICS ratings to be 50 ratings points. 



To investigate the importance of some of these 
reasons, we conducted several more experiments. 

Good initial conditions. 

A second experiment was run in which KnightCap's co- 
efficients were all initialised to the value of a pawn. 

Playing with this initial weight setting KnightCap 
had a blitz rating of 1260 ± 50. After more than 1000 
games on FICS KnightCap's rating has improved to 
about 1540 ± 50, a 280 point gain. This is a much 
slower improvement than the original experiment, and 
makes it clear that starting near a good set of weights is 
important for fast convergence. 

Self-Play 

Learning by self-play was extremely effective for TD- 
Gammon, but a significant reason for this is the stochas- 
ticity of backgammon. However, chess is a determin- 
istic game and self-play by a deterministic algorithm 
tends to result in a large number of substantially similar 
games. This is not a problem if the games seen in self- 
play are "representative" of the games played in prac- 
tice, however KnightCap's self-play games with only 
non-zero material weights are very different to the kind 
of games humans of the same level would play. 

To demonstrate that learning by self-play for Knight- 
Cap is not as effective as learning against real oppo- 
nents, we ran another experiment in which all but the 
material parameters were initialised to zero again, but 
this time KnightCap learnt by playing against itself. Af- 
ter 600 games (twice as many as in the original FICS 
experiment), we played the resulting version against the 
good version that learnt on FICS, in a 100 game match 
with the weight values fixed. The FICS trained version 
won 89 points to the self -play version's 11. 

6. Backgammon Experiment 

For our backgammon experiment we were fortunate to 
have Mark Land (University of California, San Diego) 



provide us with the source code for his LGammon pro- 
gram which has been implemented along the lines of 
Tesauro's TD-Gammon[[7j ^J. 

Along with the code for LGammon, Land also pro- 
vided a set of weights for the neural network. The 
weights were used by LGammon when playing on the 
First Internet Backgammon Server (FIBS, fibs.com), 
where LGammon achieved a rating which ranged from 
1600 to 1680, significantly above the mean rating across 
all players of about 1500. For convenience, we refer to 
the weights as the FIBS weights. 

Using LGammon and the FIBS weights to directly 
compare searching to two-ply against searching to one- 
ply, we observed that two-ply is stronger by 0.25 points- 
per-game, a significant difference in backgammon. Fur- 
ther analysis showed that in 24% of positions, the move 
recommended by a two-ply search differed from that 
recommended by a one-ply search. 

Subsequently, we decided to investigate how well 
TD-directed(A) and TDLeaf(A), both of which can 
search more deeply, might perform. Our experiment 
sought to determine whether either TD-directed(A) or 
TDLeaf(A) could find better weights than standard 
TD(A). 

To test this, we suitably modified the algorithms to 
account for the stochasticity inherent in the game, and 
took two copies of the FIBS weights — the end product 
of a standard TD(A) training run of 270,000 games. We 
trained one copy using TD-directed(A) and the other us- 
ing TDLeaf(A). Each network was trained for 50000 
games and then played against the unmodified FIBS 
weights for 1600 games, with both sides searching to 
two-ply and the match score recorded. 

The results fluctuated around parity with the FIBS 
weights (the product of TD(A) training), with no statisti- 
cally significant change in performance being observed. 
This suggests that the solution found by TD(A), is either 
at or near the optimal for two-ply play. 

7. Discussion and Conclusion 

We have introduced TDLeaf(A), a variant of TD(A) for 
training an evaluation function used in minimax search. 
The only extra requirement of the algorithm is that the 
leaf-nodes of the principal variations be stored through- 
out the game. 

We presented some experiments in which a chess 
evaluation function was trained by on-line play against a 
mixture of human and computer opponents. The exper- 
iments show both the importance of "on-line" sampling 
(as opposed to self-play), and the need to start near a 
good solution for fast convergence. 

We compared training using leaf nodes (TDLeaf(A)) 
with training using root nodes, both in chess with a 
linear evaluation function and 5-10 ply search, and in 
backgammon with a one hidden layer neural-network 
evaluation function and 2-ply search. We found a sig- 
nificant improvement training on the leaf nodes in chess, 



which can be attributed to the substantially different 
distribution over leaf nodes compared to root nodes. 
No such improvement was observed for backgammon 
which suggests that the optimal network to use in 1-ply 
search is close to the optimal network for 2-ply search. 

On the theoretical side, it has recently been shown 
that TD(A) converges for linear evaluation functions 
[|lO|]. An interesting avenue for further investigation 
would be to determine whether TDLeaf(A) has similar 
convergence properties. 
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